Statistical models for improving the performance of database operations

ABSTRACT

The invention relates to a method for the automatic, software-driven statistical evaluation of large amounts of data that is to be assigned to statistical variables in a database. Said method is characterised in that a statistical model, which approximately describes the relative frequencies of the states of the variables and the statistical dependencies between said states, is learnt and is used to determine the approximate relative frequencies of states of the variables, in addition to the approximate relative frequencies belonging to predeterminable relative frequencies of states of the variables and expectation values of the states of variables dependent thereon.

[0001] This invention relates to a method for the automatic,software-driven statistical evaluation of large amounts of data that isto be assigned to statistical variables in a database. The data to beevaluated can, in particular, be contained in one or several clusters.

[0002] Nowadays databases are in the position to store immense amountsof data. In order to evaluate the stored data and to be able to extractprofitable information, efficient i.e. quick and specific databaseaccesses are required because of the data occupancy.

[0003] In general, for an evaluation all the data must be found thatconforms to a pre-determinable condition. Often it is not the case thatthe located data itself must be known, but often only information aboutthe statistics based on the data is required.

[0004] If, for example, in a customer relationship management (CRM)system in which customer data is stored, it be determined whatproportion of customers with specific features bought a certain product,a simple procedure could be to access all the customer entries in thedatabase, request all the features of the customers and under these tofind out and count those entries which “match” the desired features forwhich the customers bought the specific product. For example, such arequest to the database could be as follows: how often were specificmobile telephones purchased by male customers who are at least 30 yearsold? Therefore, all the customer entries that conform to therequirements “male” and “at least 30 years old” must be found in whichcase a test must be performed for the matching entries found todetermine which mobile telephone was purchased the most.

[0005] However, a disadvantage of this procedure is the fact that theentire database has to be read to find the matching entries. This canoccasionally take a very long time in the case of very large databases.

[0006] The database can be searched more skillfully and more efficientlyif all the variables are provided with selective indexes that can berequested. In this case it is a rule that the more exact andsophisticated the applicable index technique of a database is, thequicker the database can be accessed. More efficient statisticalinformation about the database entries can also be provided accordingly.This in particular applies if the database is specifically prepared by aspecial index technique for the requests to be expected.

[0007] Alternatively or in combination with index techniques, theresults of all the statistical requests to be expected can bepre-calculated which has the disadvantage of considerable effortrequired for the calculations and storage of results.

[0008] The term “online analytical processing” (OLAP) characterizes aclass of methods for extracting statistical information from the data ofa database. In general, such methods can be subdivided into “relationalonline analytical processing” (ROLAP) and “multidimensional onlineanalytical processing” (MOLAP).

[0009] The ROLAP method only makes slight pre-calculations. Whenrequesting the statistics, the data about the index techniques requiredfor a response to the request is accessed and the statistics are thencalculated from the data. The emphasis of ROLAP is then on a suitableorganization and indexing of the data to find and load the required dataas quickly as possible. Nevertheless, the effort for large amounts ofdata can still be very great and in addition the selected indexing issometimes not optimum for all the requests.

[0010] In the MOLAP method the focus is on pre-calculating the resultsfor many possible requests. As a result, the response time for apre-calculated request remains very short. For requests that have notbeen pre-calculated, the pre-calculated values can sometimes also leadto an acceleration if the desired sizes can be calculated from thepre-calculated results, and this means that it is more cost-effectivethan directly accessing the data. The number of all possible requestsincreases rapidly with the increasing number of states of thesevariables so that the pre-calculation hits against the limits of thepresent possibilities with regard to memory location and turnaroundtime. Restrictions with regard to the variables considered, thedifferent states of these variables or the permissible requests mustthen be taken into consideration.

[0011] Even though the OLAP method guarantees an increase in theefficiency compared to merely accessing each database entry it isdisadvantageous that a great amount of redundant information has to begenerated here. Therefore, statistics must be pre-calculated andextensive index lists created. In general, an efficient application ofan OLAP method also requires that this method is optimized to specificrequests in which case the OLAP method is then also subject to theseselected restrictions, i.e. no random requests can be made to thedatabase.

[0012] In addition, it is also true for the OLAP method that, the morequickly the information is to be provided and the more this informationvaries, the more structures must be pre-calculated and stored.Therefore, OLAP systems can become very large and are by far lessefficient than would be desired, response times of less than one secondcan in practice not be implemented for any statistical requests to alarge database. Often the response times are considerably more than onesecond.

[0013] Therefore, there is a need for more efficient methods for thestatistical evaluation of data entries. In such cases the requestsshould not be subject to any restrictions if possible.

[0014] The object of this invention is to overcome the disadvantages ofthe methods known in the prior art, particularly, the OLAP method forthe statistical evaluation of database entries.

[0015] The methods according to the features of the contingent claimsachieve this object according to the invention. Advantageousdevelopments of the invention are specified in the subclaims.

[0016] According to the invention, a method is shown for the automatic,software-driven statistical evaluation of large amounts of data that isto be assigned to statistical variables in a database, in particular,contained in one or more clusters which is characterized in that astatistical model for the approximate description of the relativefrequencies of the states of the variables and the statisticaldependencies between said states, is learnt by means of the data storedin the database and is used to determine, on the basis of thestatistical model, the approximate relative frequencies of states of thevariables, in addition to the approximate relative frequencies belongingto the pre-determinable relative frequencies of states of the variablesand expected values of the states of variables dependent thereon.

[0017] Unlike the conventional method for statistical evaluation of thedata from databases, the model is not an exact image of the statisticsof the data. In general, this procedure obtains no exact, but onlyapproximate statistical statements. However, the statistical models aresubject to fewer restrictions than, for example, the conventional OLAPmethods.

[0018] In order to make approximate, statistical statements, the entriesare then “condensed” in a database to a statistical model in which casethe statistical model virtually represents an approximation of the“common probabilistic distribution” of the database entries. Inpractice, this takes place by learning the statistical model on thebasis of database entries, in which case the relative frequencies of thestates of the variables of the database entries can approximately bedescribed in this sequence. Therefore, the variables can capture manystates with different, relative frequencies. As soon as such astatistical model is available, this can be used to study the relativedependencies between the states of the variables. According to apre-determinable condition, relative frequencies of the states of thevariables can be specified in this way and are used to determine therelative frequencies of states of the variables belonging topredetermined relative frequencies of states of the variables dependentthereon.

[0019] A statistical request to the database can in this way be made asa condition for the relative frequencies of specific states of thevariables in which case a response to the statistical request is used todetermine the relative frequencies of states of the variables belongingto predetermined relative frequencies of states of the variablesdependent thereon.

[0020] As the statistical model, a graphical probabilistic model ispreferably used (see e.g.: Castillo, Jose Manuel Gutierrez, Ali S. Hadi,Expert Systems and Probabilistic Network Models, Springer, N.Y.). Thegraphical probabilistic models particularly include the Bayesiannetworks or Belief networks and Markov networks.

[0021] A statistical model can for example be generated by structuredtheories in Bayesian networks (see e.g.: Reimar Hofmann, Lernen derStruktur nichtlinearer Abhängigkeiten mit graphischen Modellen—learningthe structure of non-linear dependencies with graphical models—,Dissertation, Berlin, or David Heckermann, a tutorial on learningBayesian networks, Technical Report MSR-TR-95-06, Microsoft Research).

[0022] A further possibility is to learn the parameters for a fixedstructure (see e.g.: Martin A. Tanner: Tools for Statistical Inference,Springer N.Y., 1996).

[0023] Many learning methods use the likelihood function as anoptimization criterion for the parameters of the model. A particularembodiment here is the expectation maximation (EM) learning method thatis explained below in detail on the basis of a special model. Inprinciple, it mainly does not concern a generalization ability of themodels, but it is only necessary to obtain a good adaptation of themodels to the data.

[0024] As the statistical model, a statistical clustering modelpreferably a Bayesian clustering model is used by means of which thedata is subdivided into many clusters.

[0025] Similarly, a clustering model based on a distance measurement canbe used together with a statistical model by means of which the data islikewise subdivided into many clusters.

[0026] By using clustering models, a very large database breaks downinto smaller clusters that, for their part, can be interpreted asseparate databases and can be handled more efficiently based on thecomparably smaller size. Here the statistical evaluation of the databasetests whether or not a predetermined condition can be mapped via thestatistical model to one or more clusters. Should this be applicable,the evaluated data will be restricted to one cluster or a number ofclusters. Similarly, it is possible that such clusters are restricted tothose in which the data conforming to the predetermined conditioncontains at least one specific relative frequency. The remainingclusters in which only a smaller amount of data is contained accordingto the predetermined condition can be ignored because the consideredprocedure only aims at approximate statements.

[0027] For example, a Bayesian clustering model (a model with a discretelatent variable) is used as a statistical clustering model.

[0028] This is described in further detail below:

[0029] Given a set of statistical variables {A, B. C, D, . . . }, or inother words, a set of fields in a database table. The relevant lowercase letters describe the states of the variables. Therefore, variable Acan also accept the states {a₁ a₂, . . . }. The states are assumed to bediscrete; but in general continuous (real-value) variables are alsopermitted.

[0030] An entry in the database table consists of values for all thevariables in which case the values belonging to an entry are combinedinto one data record D for all the variables. For example, x^(π)=(a^(π),b^(π), c^(π), d^(π), . . . ) describes the flth data record. The tablehas M entries, i.e. D={x^(π), π=1, . . . , M}.

[0031] In addition, there is a hidden variable (cluster variable) thatis designated with Ω. The cluster variable can accept the values {ω_(i),i=1, . . . , N}; i.e. there are N clusters.

[0032] Here, P(Ω|θ) describes a priori distribution of the cluster inwhich case the a priori weight of the ith cluster is given viaP(ω_(i)|θ) and θ represents the parameters of the model. The a prioridistribution describes which cluster of the data is assigned to therelevant clusters.

[0033] The expression P (A, B, C, D, . . . |ω_(i), |θ) describes thestructure of the ith cluster or the conditional distribution of thevariables of the variable set {A, B, C, D, . . . } within the ithcluster.

[0034] The a priori distribution and the distributions of theconditional probabilities of each cluster thus together parameterize onecommon probabilistic model on {A, B. C, D, . . . } U Ω or on {A, B, C,D, . . . }. The probabilistic model is given by the product from the apriori distribution and the conditional distribution

P(A,B,C, . . . ,Ω|Θ)=P(Ω|Θ) P(A,B,C, . . . |Ω,Θ),

[0035] or by

P(A,B,C, . . . |Θ)=Σ_(i) P(ω_(i)|Θ) P(A,B,C, . . . |ω_(i),Θ).

[0036] The logarithmic likelihood function L of parameter θ of the datarecord D is now given by

L(Θ)=log P(D|Θ)=Σ_(π)log P(x ^(π|)Θ).

[0037] Within the context of the expectation maximation (EM) theory, asequence of parameters θ^((t)) is now constructed according to thefollowing general specification:

Θ^((t+1)) =arg max_(Θ)Σ_(π)Σ_(i) P(ω_(i) |x ^(π),Θ^((t))) log P (x^(π),ω_(i)|Θ)

[0038] This iteration specification maximizes the likelihood functionstep by step.

[0039] For the conditional distributions P(A, B, C, D, . . .

_(i), θ), restrictive assumptions can (and must, in general) be made. Anexample of such a restrictive assumption is the following factorizationassumption:

[0040] If for example for the distribution of the conditionalprobabilities P(A, B, C, D, . . .

_(i), θ) of the variables of the variable set {A, B, C, D, . . . }, thefactorization P(A, B, C, D, . . .

_(i), θ)=P(A

_(i)θ)P (B

_(i)θ)P(C

_(i)θ)P(D

_(i)θ) . . . is accepted, the probabilistic model conforms to a naiveBayesian network. Instead of a largely dimensional table one is now onlyconfronted with many one dimensional tables (tables for one variable ineach case).

[0041] The parameters of the distribution can, as shown above, be learntfrom the data with an expectation maximation (EM) learning method. Acluster can be assigned to each data record x^(π)=(a^(π), b^(π), c^(π),d^(π), . . . ) after the learning process. The assignment is then takesplace via the a posteriori distribution P(Ω

^(π), b^(π), c^(π), d^(π), . . . , θ) in which case the cluster ω_(i)with the highest weight P(ω_(i)

^(π), b^(π), c^(π), d^(π), . . . , is assigned to the data record x^(π).

[0042] The cluster affiliation of each entry in the database can bestored as an additional field in the database and corresponding indexescan be prepared to quickly access the data that belongs to a specificcluster.

[0043] If, for example, a statistical request of the type “give all thedata records with A=a₁ and B=b₃ as well as the relevant distribution viaC and D (i.e. P(C|a₁, b₃) and P(D|a₁, b₃))” is made to the database,proceed as follows:

[0044] First of all, the a posteriori distribution P(Ω

₁, b₃) is determined. From this distribution (approximate) it is clearwhat proportion of the data must be found in which clusters of thedatabase according to the set condition. In this way, it is possible torestrict oneself in the case of all further processes, depending on thedesired accuracy, to the clusters of the database that have a high aposteriori weight according to P(Ω

₁, b₃).

[0045] The ideal case is when P(Ω

₁, b₃)=1 applies to an i and accordingly P(Ω

₁, b₃)=0 for all j≠i, i.e. all the data corresponding with the setcondition lies in one cluster. In such a case, it is possible torestrict oneself to the ith cluster without losing accuracy in furtherevaluation.

[0046] In order to obtain (approximate) distributions for C and D, it ispossible to either carry on using the model, i.e. approximatelydetermine the desired distributions P(C|a₁, b₃) and P(D|a₁, b₃) based onthe parameters of the model:

P(C|a ₁ , b ₃)≅Σ_(i) P (C|ω _(i) , a ₁ , b ₃, Θ) P (ω_(i) |a ₁ , b ₃,Θ).

[0047] However, alternatively the model can also only be used todetermine the clusters that are relevant for the current request.

[0048] After restricting the request to these clusters, more exactmethods can be used within the clusters. E.g. the statistics within theclusters can be counted exactly (with the help of an additional indexreferring to the cluster affiliation or based on the conventionaldatabase reporting method or the OLAP method) or further statisticalmodels adapted to the clusters can be used. A tight interlocking withOLAP is particularly advantageous because the so-called “sparsity” ofthe data is utilized in large dimensions by statistical clusteringmodels and the OLAP methods are only used effectively within the smallerdimensional clusters.

[0049] The trade-off for speed and accuracy when evaluating results fromthe amount of data excluded from the evaluation: the more clustersexcluded from the evaluation, the quicker, but also more inexactly, theresponse to a statistical request will be. The user himself candetermine the trade-off between accuracy and speed. In addition, moreexact automatic methods can be initiated if an insufficient accuracyfrom evaluating the model seems to be apparent.

[0050] In general, clusters that are below a specific minimum weight areexcluded from the evaluation. Exact results can be obtained by excludingonly such clusters from the evaluation that have an a posteriori weightof zero. Here, an exact “indexing” of the clusters can be reached as aresult of an exact indexing of the database in which case the evaluationis accelerated in many cases. However, in general as many clusters aspossible are used for the evaluation.

[0051] Overtraining a clustering model is of no importance, because onthe contrary the aim is to produce the most exact reproduction possibleof historical data and not a prognosis for the future. In the same way,intensely overtrained clustering models tend to supply a the mostunambiguous possible assignment of requests to clusters, which meansthat in further operations it is possible to limit the request to smallclusters of the database very quickly.

[0052] In an advantageous way, the data belonging to a cluster is storedon a data carrier in a way appropriate to the cluster affiliation. Forexample, the data belonging to one cluster can be stored on a section ofthe hard disk so that the data in a block belonging together can be readmore quickly.

[0053] As has already been shown according to the method of theinvention, conventional methods for the statistical evaluation of thedata from databases can also be used in a supplementary way ifapproximate statements are deemed to be insufficient. In particular,conventional database reporting or OLAP methods are used to determinethe relative frequencies of the states of the variables.

[0054] A supplementary application of conventional database techniquescan for example be initiated automatically if a definable test variableaccepts or exceeds a predetermined value.

[0055] According to the invention, a method is shown for the automatic,software-driven statistical evaluation of large amounts of data that isto be assigned to statistical variables in a database, in particular,contained in one or several clusters which is characterized in that thedata is subdivided into many clusters by a clustering model based ondistance measurement and, if required, the considered data is restrictedto the data contained in one cluster or several clusters and in whichcase the database reporting methods or the OLAF methods are used todetermine the relative frequencies and expected values of the states ofvariables.

[0056] The methods shown in the invention can subdivide the data of thedatabase into clusters as well as, if required, result in a restrictionto one cluster or several clusters. If the methods according to theinvention are used for data that is already contained in one cluster orseveral clusters, the clusters are in this way subdivided intosubclusters. If restriction is to be to one or more subclusters, themethods according to the invention for the data contained therein can beused, in which case, if required, more exactly adapted statisticalmodels can be used. In general, this procedure can be repeated as oftenas desired, i.e. the clusters can be subdivided into subclusters or thesubclusters into sub-subclusters as often as desired, etc. and, ifrequired, there can be a restriction to the data contained therein ineach case and the methods according to the invention used (adapted moreexactly) for the data contained in the considered clusters.

[0057] An embodiment of the invention in the Web reporting/Web miningarea is described below in which case reference is made to theaccompanying drawings.

[0058]FIG. 1 Shows different monitor windows in which variables fordescribing the visitors to a Web site are displayed.

[0059]FIG. 2 Shows different monitor windows of the variables of FIG. 1in which case the behavior of visitors of a specific referrer isinvestigated.

[0060]FIG. 3 Shows different monitor windows of the variables of FIG. 1in which case the behavior of visitors that call up the homepage first,then read the news and subsequently again call up the homepage isinvestigated.

[0061] In general, in the Web reporting/Web mining area large amounts ofdata has to be evaluated. Should a user visit a Web site, each action ofthe user is usually recorded in the Web log file. This is verydata-intensive because such Web log files can increase very rapidly tosizes in the region of several gigabytes.

[0062] In order to prepare the evaluation of the Web log files,“sessions” or visits by visitors were extracted, i.e. all the successiveentries (page retrievals or clicks) belonging to a visitor aresummarized.

[0063] Each session by a visitor was characterized by a set of differentvariables, namely particularly “start time”, “session duration”, “numberof requests”, “referrer”, “1st visited category”, “2nd visitedcategory”, “3rd visited category” and “4th visited category”.

[0064] In addition, further variables (not shown in the figures) werespecified such as “does the visitor accept cookies”, “number of sessionsthat the visitor had already had up to the current session”, “number ofpages retrieved in the last session”, “interval in time to the lastsession”, “on which page did the last session end”, “time of the firstsession by the visitor” and “weekday”.

[0065] Altogether, each session was characterized in this way on thebasis of 18 different variables.

[0066] In order to determine the relative frequencies of the states ofthe variables, a naive Bayesian clustering model, as described above,was used.

[0067] Therefore, the specified variables were integrated in thestatistical model. The statistical model was trained below by the datacontained in the Web log files to find good parameters for the model.The desired relative frequencies can then be read from the model.

[0068] The result of determining the relative frequencies of the statesof the variables is displayed in FIG. 1. FIG. 1 shows different monitorwindows in which the variables “start time”, “session duration”, “numberof requests”, “referrer”, “1st visited category”, “2nd visitedcategory”, “3rd visited category” and “4th visited category” to describethe visitors to a Web site are shown.

[0069] From FIG. 1 it must particularly be identified that

[0070] approximately 55% of the visitors visit the Web site during theafternoon or evening,

[0071] approximately 47% of the visitors only remain less than 1 minuteon the Web site,

[0072] approximately 34% of the visitors only start one request,

[0073] approximately 56% of the visitors do not have a referrer,

[0074] approximately 45% of the visitors start on the homepage, and

[0075] approximately 57% of the visitors only visit 1 category,

[0076] approximately 74% of the visitors only 2 categories and

[0077] approximately 85% of the visitors only 3 categories.

[0078] After the statistical model based on an EM learning method wastrained, the dependencies between the variables could also be studied.

[0079] As can be seen in FIG. 2, the behavior of for example thosevisitors that came from a specific referrer (referred to as Endemannbelow) was investigated. For this, the corresponding entry in thevariable “referrer” was set at 100%. By using the statistical model, itcould be determined within fractions of a second that particularlyapproximately 99% of these visitors first visit the homepage andsubsequently in the predominant majority (approximately 96%) againimmediately leave the Web site.

[0080]FIG. 3 displays a complicated request to the database. FIG. 3shows different monitor windows of the variables to be considered inwhich case the behavior of the visitors that call up the homepage first,then read the news and subsequently again call up the homepage isinvestigated. Here the corresponding entries in the variables “1stvisited category”, “2nd visited category” and “3rd visited category”were set at 100%.

[0081] Again, it could particularly be determined by means of thestatistical model within fractions of a second that these visitors thenpredominantly either again read the news (approximately 37%) or left theWeb site (approximately 36%). It can also be seen in FIG. 3 thatapproximately 89% of these visitors have no referrer.

[0082] In a corresponding way, a response could be given to an amplitudeof further requests to the database within a short period, i.e. ingeneral, within less than 1 second. For example, it could be testedwhich section of the visitors that come from a specific referrer makesmore than three side requests, how these people are distributed over thetime of day and which one of these visitors is a returning visitor. Itcould also be tested how the visitor traffic of those visitors startingwith the homepage is distributed, i.e. which section of the visitorscontinues or subsequently aborts the session in which way.

[0083] Such an amplitude of requests with many different variables inthe case of the data that simultaneously has the same size can only behandled more efficiently with the method according to the inventioncompared to the conventional database techniques, particularly the OLAPmethods. Similarly, conventional OLAP methods can also be used inaddition to this, if exact statements are to supplement the approximatestatements gained by the statistical model. However, considerably longerresponse times must then be taken into consideration.

[0084] To summarize, it can be established that this invention asopposed to the conventional database techniques, particularly thedatabase reporting and OLAP methods, can answer statistical requestsmade to extensive databases more or less by using statistical models ina more efficient way. This does not exclude that conventional techniquesfor evaluating databases can be used in a corresponding way to haveexact statements, if required. By using a clustering model by means ofwhich the database can be broken up into smaller clusters, it ispossible to restrict oneself very quickly for requests made to therelevant clusters of a database (approximately or exactly). If clustersof the database were restricted, a recent statistical evaluation ofthese clusters of the database can be carried out according to theinvention in the course of which, if required, a renewed restriction ofthe subclusters contained in these clusters of the database, as well asa renewed statistical evaluation of the data contained in thesubclusters can be made. In general, this procedure can be repeated asoften as desired. Here it is possible to create more efficientstatistics or respond to statistical requests.

[0085] Similarly, according to the invention, a clustering model basedon a distance measurement can be used to subdivide the data of adatabase into many clusters in which case the relevant clusters of thedatabase (cluster) are restricted. In order to determine the relativefrequencies and expected values of the states of variables, conventionaldatabase reporting methods or OLAP methods are used.

[0086] In principle, this invention can be used everywhere where anefficient statistical evaluation of large amounts of data is required.

[0087] Therefore, a possible application is in the Web reporting/Webmining area as has already been shown in the embodiment.

[0088] Further possible applications can for example be found therewhere the customer data is obtained in large amounts, such as:

[0089] data from call centers,

[0090] data from operational custom relationship management systems,

[0091] data from the health area,

[0092] data from medical databases,

[0093] data from environmental databases,

[0094] data from genome databases,

[0095] data from the financial area.

1. Method for the automatic, software-driven statistical evaluation oflarge amounts of data that is to be assigned to statistical variables ina database, in particular, contained in one or several clusters which ischaracterized in that a statistical model for the approximatedescription of the relative frequencies of the states of the variablesand the statistical dependencies between said states, is learnt and bymeans of the data stored in the database and is used to determine, onthe basis of the statistical model, the approximate relative frequenciesof states of the variables, in addition to the approximate relativefrequencies belonging to the pre-determinable relative frequencies ofstates of the variables and expected values of the states of variablesdependent thereon.
 2. Method according to claim 1, characterized in thatas the statistical model, a graphical probabilistic model, in particulara Bayesian network, is used.
 3. Method according to claim 1,characterized in that a statistical clustering model, in particular aBayesian clustering model, is used by means of which the data issubdivided into many clusters.
 4. Method according to claim 1,characterized in that likewise a clustering model based on a distancemeasurement is used by means of which the data is likewise subdividedinto a plurality of clusters.
 5. Method according to claim 3 or 4,characterized in that the considered data is restricted to the datacontained in one cluster or a number of clusters.
 6. Method according toclaim 5, characterized in that it is possible that such clusters arerestricted in which the data belonging to the specific states ofvariables contains at least one specific relative frequency.
 7. Methodaccording to one of the claims 4 to 6, characterized in that the databelonging to a cluster is stored on a data carrier in a way appropriateto the cluster affiliation.
 8. Method according to one of the previousclaims, characterized in that database reporting methods or OLAF methodsare further used to determine the relative frequencies and expectedvalues of the states of variables.
 9. Method according to claim 8,characterized in that database reporting methods or OLAP methods areused if a test variable assumes or exceeds a predetermined value. 10.Method for the automatic, software-driven statistical evaluation oflarge amounts of data that is to be assigned to statistical variables ina database, in particular, contained in one or several clusters which ischaracterized in that, the data is subdivided into many clusters by aclustering model based on distance measurement and, if required, theconsidered data is restricted to the data contained in one cluster orseveral clusters, and database reporting methods or the OLAF methods areused to determine the relative frequencies and expected values of thestates of variables.
 11. Application of the method according to one ofthe previous claims for the statistical evaluation of customer data, inparticular, in the Web reporting/Web mining area and in customerrelationship management systems.
 12. Application of the method accordingto one of the previous claims for the statistical evaluation ofenvironmental databases, medical databases or genome databases.